Unsupervised Learning

11/19/2020

Agenda

Introduction
K-means
Hierarchical clustering
Principal Component Analysis (PCA)
Principal Component Regression (PCR)

Recap

Unsupervised learning

We have seen a lot of models for \(y \mid x\). This is called supervised learning (we explain the relationship between \(x\) and a special chosen variable \(y\)).

Now we will see models for \(x\) alone.

This generally means that analysis goals are not clearly defined and we cannot directly validate the findings (no such thing as RSS)
However, there are still many important questions that we can consider:
- Finding lower-dimensional representations
- Finding clusters / groups

Principal Components Analysis (PCA)

Main goal

Principal component analysis (PCA) produces a low-dimensional representation of a dataset
Basic idea: Find a low-dimensional representation that approximates the data as closely as possible in Euclidean distance

It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated
It can produce variables for use in supervised learning problems (PCA regression), and it can also serve as a tool for data visualization

Example

PCA: details

The first principal component (PC1) of a set of features \(X_1,\dots, X_p\) is the normalized linear combination of the features \[Z_1 = \varphi_{11} X_1 + \varphi_{21} X_2 +\dots + \varphi_{p1} X_p\] that has the largest variance. By normalized, we mean that \(\sum_{j=1}^{p} \varphi_{j1}^2 = 1\)

In this example:

\(Z_{1} = \frac{\sqrt{2}}{2} X_{1} + \frac{\sqrt{2}}{2} X_{2}\)
\(Z_{2} = - \frac{\sqrt{2}}{2} X_{1} + \frac{\sqrt{2}}{2} X_{2}\)

PCA: loadings

We refer to the elements \(\varphi_{11}, \dots, \varphi_{p1}\) as the loadings of the first principal component; together, the loadings make up the principal component loading vector, \(\boldsymbol{\varphi}_1 = (\varphi_{11}, \dots, \varphi_{p1})^\intercal\)
The loading vector \(\mathbf{\varphi}_1\) defines a direction in feature space along which the data vary the most

In this example:

\(\boldsymbol{\varphi}_{1} = \left( \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2} \right)^{\intercal}\)
\(\boldsymbol{\varphi}_{2} = \left( - \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2} \right)^{\intercal}\)

PCA: details

If we project the \(n\) data points \(\mathbf{x}_1,\dots, \mathbf{x}_n\) onto this direction, the projected values are the principal component scores \(\mathbf{z}_{1},\dots, \mathbf{z}_{n}\)

PCA: details

The second principal component is the linear combination of \(X_1,\dots,X_p\) that has maximal variance among all linear combinations that are uncorrelated with \(Z_1\)
It turns out that constraining \(Z_2\) to be uncorrelated with \(Z_1\) is equivalent to constraining the direction \(\mathbf{\varphi}_2\) to be orthogonal to the direction \(\mathbf{\varphi}_1\)

PCA directions, scores

PC loadings (directions):
- The PC1 direction is the direction along which the data has the largest variance
- …
- The PCm direction is the direction along which the data has the largest variance, among all directions orthogonal to the first \(m - 1\) PC directions
PC scores:
- The PCm score for the observation \(\mathbf{x}_{i}\) is the position of \(\mathbf{x}_{i}\) along the \(m^{th}\) PC direction

Other intepretations of PCA

Best approximation interpretation:
- PC1 score \(\times\) PC1 direction = the best 1-dimensional approximation to the data in terms of MSE
- \(\sum_{m=1}^{M}\) PCm score \(\times\) PCm direction = the best \(M\)-dimensional approximation to the data in terms of MSE
Eigenvector interpretation:
- The PCm direction is the \(m^{th}\) eigenvector (normalized to unit length) of the covariance matrix, sorting the eigenvectors by the size of their eigenvalues

Illustration

USAarrests data: for each of the fifty states in the United States, the data set contains the number of arrests per \(100,000\) residents for each of three crimes: Assault, Murder, and Rape. We also record UrbanPop (the percent of the population in each state living in urban areas)
The principal component score vectors have length \(n = 50\), and the principal component loading vectors have length \(p = 4\)

Illustration

Variable scaling

If the variables are in different units, scaling each to have standard deviation equal to one is recommended
If they are in the same units, you might or might not scale the variables

Proportion of Variance

To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one
The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as \[\sum_{j=1}^p \text{Var}(X_j) = \sum_{j=1}^p \frac{1}{n} \sum_{i=1}^n x_{ij}^2\] and the variance explained by the \(m^{th}\) principal component is \[\text{Var}(Z_m) = \frac{1}{n} \sum_{i=1}^n z_{im}^2\]

Proportion of Variance

The PVE of the \(m^{th}\) principal component is given by the positive quantity between 0 and 1 \[\dfrac{\sum_{i=1}^n z_{im}^2}{\sum_{j=1}^p \sum_{i=1}^n x_{ij}^2}\]

Principal Components Regression (PCR)

Principal Components Regression

We have seen methods of reducing model variance by:
- using a less flexible model (with fewer parameters)
- selecting a subset of predictors
- regularization / shrinkage
Another approach: transform the predictors to a lower dimensional space using PCA
Combining PCA with linear regression leads to principal components regression (PCR)

Principal Components Regression

PCR = PCA + linear regression:

Choose how many PCs to use, say, M.
Use PCA to define a new feature vector \(\mathbf{z}_{i}\) containing the PC1, …, PCM scores for \(\mathbf{x}_{i}\)
Use least-squares linear regression with this model:

\[Y_{i} = \mathbf{z}_{i}^\intercal \mathbf{\beta} + \varepsilon_{i}\]

PCR works well when the directions in which the original predictors vary most are the directions that are predictive of the outcome.

PCR versus other methods

PCR versus least-squares:

When \(M = p\), PCR = least-squares
PCR has higher bias but lower variance
PCR can handle \(p > n\)

PCR does not select a subset of predictors/features, and therefore it is more closely related to Ridge than Lasso.

Can choose PCR dimensionality M using cross-validation.

Agenda

Recap

Unsupervised learning

Principal Components Analysis (PCA)

Main goal

Example

Example

Example

Example

PCA: details

PCA: loadings

PCA: details

PCA: details

PCA directions, scores

Other intepretations of PCA

Illustration

Illustration

Variable scaling

Proportion of Variance

Proportion of Variance

Principal Components Regression (PCR)

Principal Components Regression

Principal Components Regression

PCR versus other methods

Question time